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coding DNA characteristics. In this paper, we propose the use of DNA traits not depending on protein 



Abstract 

o , 

^sO , The bioinformatical methods to detect lateral gene transfer events are mainly based on functional 

coding requirements. We introduce several semilocal variables that depend on DNA primary sequence 

(N 

and that reflect thermodynamic as well as physico-chemical magnitudes that are able to tell apart the 
genome of different organisms. After combining these variables in a neural classificator, we obtain 
results whose power of resolution go as far as to detect the exchange of genomic material between 

b ■ 

■ bacteria that are phylogenetically close. 



I. Introduction 

There is a general agreement that horizontal gene transfer (HGT) is important in genome 
evolution. To which degree is still a matter of debate. The discussion oscillates between two 
extreme positions: first, the idea that the rate of transfer and its impact are of such a magnitude as 
to be the "essence of phytogeny" -at least for Prokaryotic organisms [Doolittle, 1999fl and, on the 
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other hand, the researchers who opine that the role of HGT in evolution has been overestimated 



[ [Kurland et ah, 2003] and that, while a matter of interest, it is not relevant when compared to 
other causes of genomic evolution like paralog duplication and secondary gene looses. Most 
probably, the real weight of HGT in evolution is posed somewhere between these extremes. 
Independently of the outcome of the discussion, there is a general agreement that it is relevant 
to detect events of HGT. 

There are several proposed methods to detect HGT. They can be classified into four categories: 
deviant composition, anomalous phylogenetic distribution, abnormal sequence similarity and 



incongruent phylogenetic trees [Eisen, 2000], [Ragan, 2001], [Philippe and Douady, 2003 1. 



Deviant composition methods are based on the different phenotypic characteristics among di- 



vergent genomes. They are mainly focused on bias in GC contents or codon usage [ Mrazek and Karlin, 1 999] 
and bias in the nucleotide composition in the third and first codon position [Lawrence and Ochman, 1997| . 
Deviant genes might exist for reasons other than HGT, and only recently transferred genes would 



be detected by this method [ [Eisen, 20001 , [Hooper and Berg, 2002 1. Also, this group of methods 



normally does not detect transferred genes from phenotypic ally similar genomes. 

Anomalous phylogenetic distribution is based on the identification of homologous genes shared 
by genomes in disjunct phylogenetic lineages and its absence in close relatives (in one or both 
lineages). However, polyphyletic gene looses and rapid sequence divergence can mislead the 
identification of HGT [[Eisen, 2000| . 



Abnormal sequence similarity is based on the assumption of overall similarity as a measure 
of phylogenetic relatedness. Usually, BLAST searches (or other similar algorithms) are used to 
detect sequences in one genome more similar to sequences in divergent genomes than those se- 
quences found in phylogenetic closer genomes (the phylogenetic relationships between genomes 
are set prior the analysis according to some other criterion like rRNA phylogeny). While these 
methods work fast, they are not fully reliable because the similarity between a gene in two 
different species can be explained by a number of phenomenons besides HGT. For instance, 
evolutionary rate variation can lead to misleading results in the identification of HGT genes, 
both as false positives and false negatives [ [Eisen, 2000[ |. 

Phylogenetic analysis is often considered to be perhaps the best way to investigate the oc- 
currence of HGT because it remains the only one to reliably infer historical events from gene 



sequences [ [Eisen, 200 1; Accordingly, incongruent phylogenetic trees between different families 
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of genes will be caused by HGT; however conflicting phylogenies can be a result of either artifacts 



of phylogenetic reconstruction, HGT or unrecognized paralogy [Zhaxybayeva et ah, 2004]. 

In this study, we propose a method that does not depend on DNA functional traits and due to 
this reason is no longer correct to say that it detects HGT because it might also detect transfer of 
non-coding DNA (ncDNA). From now on, we will refer to horizontal genetic material transfer 
(HGMT). We use eight variables that can be measured over a set of windows covering whole 
genomes and combine them with an Artificial Neural Network (NN). The field of using DNA 
measurables other than those derived from protein coding requirements have been largely ignored. 
Up to our knowledge, there are no methods based exclusively on structural DNA traits to detect 
neither HGT or HGMT events. 

II. Methods 

Our approach was to take a pair of DNA sequences -genomes in case of prokaryotes, chro- 
mosomes for eukaryotes-, the first one is the donor genome and the second one the acceptor. 
The data were taken from Genbank release 24. 

To calculate the variables that characterize locally the DNA sequences, a window is placed 
over the chromosomes (Figure [Q. The window can slide over the sequences or can be put 
randomly (see below). 



A GTC G ATT AGGG ATG ATG A GC CT AGCT AGC TA GC C G 



Fig. 1. A window is placed over DNA sequence. On every position, some primary DNA sequence variables (xx,xi, ■ ■ ■ , x n ) 
are calculated. For the NN prediction stage, this window slides along the DNA sequence, (see text) 

We used a "classic NN approach" -a backpropragation multilayer perceptron (MLP), very 
similar to the model reported by Uberbacher QUberbacher et al., 1991] (Figure |2]) 

The novelty and main contribution of this paper is the set of measurables we use to evaluate 
a DNA sequence. Following the nomenclature of Uberbacher we will call them sensors. 
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Fig. 2. The sensors are mixed in a Multilayer Perceptron. The output can be zero or one depending on to which class the 
DNA sequence belongs.The MLP in the figure does not necessarily have the architecture used in the study. 



For the training stage of the MLP A window of fixed length (300bp unless otherwise stated) 
was placed repeatedly over both genomes at random independent positions and every time eight 
sensors were evaluated over the subsequence in the window and were used feed and train a MLP 
with binary output; '0' corresponding to acceptor DNA sequences and '1' to donor ones. For the 
prediction stage the window was allowed to slide along the acceptor sequence and a plot of its 
position against the outcome was obtained. 



A. Definition of the sensors 

We worked with a total of eight sensors divided in three groups. The first one includes 
traditional measures of DNA variance, the second reflects the DNA local correlations structure 
and the third one is a measure of the DNA spatial structure according to the dinucleotide 
distribution: 

1) This group comprises CG and CpG contents. There is a number of publicactions reporting 
the bias of these measures among different organisms. CpG is well known to tell appart 



prokaryal and eukaryal lineages [Shimiz u et ah, 1997] . Even if the underlying reasons are 



still unclear [Wang et al., 2004 1, it gives a good first clue. 
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2) In 1995, we proposed an index of DNA heterogeneity that disclosed different styles of 
genomic structural organization [Miramonte s et al., 1995] . Given a DNA sequence, it is 



translated into the three possible binary sequences using the groupings purine-pyrimidine 

(YR), weak-strong (WS) and amino-keto (MK). 

the following index is calculated over each binary derivative 

, AWVn - iVioiVoi n . 
d= iVoiVi (1) 

Where stands for the number of i bases followed by the j base, where % and j are zero 

or one. The phenomenology behind this index is discussed in [Miramo ntes et al., 1 995]. 



3) In 1992 the group of R.E. Dickerson [Quintana, 1992 1 reported the variability in the DNA 



structural angles depending on the dinucleotide steps. Their results can be summarized in 
the Table ?? (page 345 of the cited reference) 
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H stands for high twist profile steps for two stacked consecutive base pairs along the 
double helix. They are characterized for having high twist, positive cup and negative roll 
angels. The parameters L, I, and V are the Low, Intermediate and Variable twist. 

III. Results 

One of the open problems in designing a MLP for pattern detection is to set the number of 
neurons in the hidden layer (the number of neurons in the output layer is determined by the 
number of classes to classify). In order to find out the best suited to our interests, we ran several 
configurations and tested the resulting output with an artificial problem: To detect a fragment 
of E. coli (donor) inserted in silico in a mouse (Mus musculus) chromosome (acceptor). To this 
end, a set of 20000 fragments of length 300 of both genomes was picked up randomly to train 
the MLR The network configuration 8-5-1 yielded the results shown in Figure |3] With a 300bp 
sliding window and an overlap of 30bp the response of the MLP is steadily '0' while the sliding 
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window travel across the acceptor chromosome and then jumps to 'V when it enters the donor 
insertion and goes back to '0' for the rest of the acceptor sequence. 



Position along the genome 

Fig. 3. Mus musculus chromosome 1 (acceptor) in the horizontal axis with a sequence of E. coli (donor) inserted in the middle. 
The ordinate can only be '0' or 'l' depending whether the output of the MLP classify the sequence as an acceptor or as a donor. 
The horizontal scale is arbitrary. 

Once the MLP was well tuned, we carried out three case studies: 

1) To detect a prokaryal insertion in a prokaryal genome. In this case we selected, for no 
particular reasons, the genome of Archeo globus fulgidus as the acceptor genome and 
Pseudomonas aeruginosa as the donor species in a second in silico experiment. Figure @] 
shows that the results are good enough to encourage one step further. 



^ 0.8 
Q. 

O 06 



Position along the genome 



Fig. 4. Archeoglobus fulgidus (acceptor) in the horizontal axis with a sequence of Pseudomonas aeruginosa (donor) inserted in 
the middle. The ordinate can only be '0' or 'l' depending whether the output of the MLP classify the sequence as an acceptor 
or as a donor. 
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2) Next attempt is the detection of a real horizontal gene transfer already reported in the 



literature [Kr oll, 1998] . It is a gene (Cu-Zn-superoxide dismutase) of Haemophylus ducreyi 
inserted in the Neisseria mengiditis genome. Figure [5] shows a clear detection. 



.i 



471 4.72 4 7 3 4 74 4.75 4.7B 4.77 

Posilion along the genome » 



Fig. 5. Neisseria mengiditis (acceptor) in the horizontal axis with a sequence of Haemophylus ducreyi (donor) inserted in the 
middle. The ordinate can only be '0' or '1' depending whether the output of the MLP classify the sequence as an acceptor or 
as a donor. The horizontal thick line emphasizes the region of insertion. 



3) Last example is a case of organelle to nuclear genome transfer. Figure |6] shows chromosome 
2 of Arabidopsis thaliana as the acceptor sequence for its own mitochondrion. As the 
horizontal scale unit is 30bp, the inserted fragment goes from 107,653 to 116,985 which 
correspond to the nucleotides from approximately 3,230,000 to 3,510,000 there is then 
is a clear inserted fragment of the order of 270kb which coincides with the unexpected 



case of an organelle to nuclear transfer event QXiaoying et al., 1999p . This case is im- 
portant for our claims because approximately more than sixty percent of the Arabidopsis 
thaliana' 's mitochondrial genome is not translated into aminoacids [ |Unseld et ah, 19971 . 



Notwithstanding, our method clearly detects the transfer. 



IV. Discussion 



In this paper we show the feasibility of using variables not related to the DNA function as 
measurables that can be used to detect horizontal exchange of genetic material between different 
species. The results are very good and encourage the further development of this line of research. 
It is a matter of future work to build on a complete set of variables with the minimum size but 
having the maximum power of resolution. 
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Position along the genome « 10' 



Fig. 6. Arabidopsis thaliana (acceptor) in the horizontal axis with a sequence of its own mitochondrial genome (donor) inserted 
in the middle. The horizontal scale unit is 30bp 

This paper also contributes to clarify the biology behind the phenomenon of horizontal gene 
transfer. For instance, the degree of amelioration of horizontally transferred genes is somehow 
related to the accuracy to which different methods detects xenologous genes. i,e., the sequencing 
of the chromosome 2 from Arabidopsis thaliana, has revealed a large and unexpected organellar- 
to-nuclear genetic material transfer event of the mitochondrial genome (20). The sequence in the 
nucleus is 99% identical to the mitochondrial genome, suggesting that the transfer event is very 
recent. Therefore, amelioration has been negligible, and the MLP clearly detects the transferred 
DNA. On the other hand, as shown in Figure |5l the method presented here, effectively identifies 
DNA that has been transferred from Haemophilus sp. (a Gamma-proteobacteria) to Neisseria 
meningitidis, (a member of the Beta-proteobacteria subdivision) QKroll, 1998] . However, the 
high degree of spreading of the points in Figure |5] suggest that accumulated mutations since the 
horizontal gene transfer event, have likely ameliorated sodC in N. meningitidis. There are 136 
nucleotide differences between H. ducreyi sodC and the homologous gene from N. meningitidis. 
If we assume equal rates of substitutions among the two sequences, then an approximate of 68 
new mutations have accumulated since the horizontal transfer event in each sequences (out of 
561 nucleotides in N. meningitidis sodC). This is a substantial amount of change if we compare 
to the number of differences among 16S rRNA in the two species (83 differences accumulated 
in each gene, out of 1544 nucleotides in N. meningitidis 16S rRNA) and if we take into account 
that the transfer event must have happened after the divergence of the two species. The extent 
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to which the NN can go in detection when amelioration occurs will be reported elsewhere. 
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